Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9610 / 000135_owner-urn-ietf _Thu Oct 31 08:41:34 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 4KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id IAA12102 for urn-ietf-out; Thu, 31 Oct 1996 08:41:34 -0500 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id IAA12097 for <urn-ietf@services.bunyip.com>; Thu, 31 Oct 1996 08:41:31 -0500 Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA21856 (mail destined for urn-ietf@services.bunyip.com); Thu, 31 Oct 96 08:41:26 -0500 Received: from ifi.unizh.ch by josef.ifi.unizh.ch id <00846-0@josef.ifi.unizh.ch>; Thu, 31 Oct 1996 14:38:58 +0100 Subject: Re: [URN] New syntax draft (was: no subject) To: gjw@wnetc.com (Gregory J. Woodhouse) Date: Thu, 31 Oct 1996 14:38:57 +0100 (MET) Cc: jayhawk@ds.internic.net, urn-ietf@bunyip.com In-Reply-To: <Pine.SGI.3.95.961031050249.19154D-100000@shellx.best.com> from "Gregory J. Woodhouse" at Oct 31, 96 05:15:43 am Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 2937 From: Martin J Duerst <mduerst@ifi.unizh.ch> Message-Id: <"josef.ifi..493:31.09.96.13.39.00"@ifi.unizh.ch> Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Martin J Duerst <mduerst@ifi.unizh.ch> Errors-To: owner-urn-ietf@bunyip.com Gregory Woodhouse wrote: >I'm also concerned because decoding requires knowledge that the string >being decoded is UTF-8. Not exactly. It is very rare that a string that is not UTF-8 looks like an UTF-8 string. >I said before that I wasn't terribly excited about >using RFC 1522 style encodings, Me neither! "charset" is okay for long mails, and RFC 1522 was unavoidable at the time it was created, but now we have better solutions. >but it seems essential in this case because >otherwise the UA has no way of knowing (except by guessing) that the string >is encoded UTF-8 and not just an ASCII string. This difference is always absolutely clear! A string that only contains bytes with the high bit '0' (7-bit bytes) is ASCII. As soon as a high bit is set to '1', it's not ASCII anymore. It may be UTF-8, or may not. >It is conceivable that some >URN schemes will allow the sequences like %20%1E or whatever, so this could >be a real problem. Well, %20 is a space, which is ASCII, but should be escaped. %1E is a control character (record separator), never appears in UTF-8, and strictly speaking is also not part of ASCII. >Another thought: What if someone wants to come along an >do %hhhh style encoding of UTF-16? Will this be mistaken for encoded UTF-8? Well, anybody can come along and propose new weird syntax rules for URNs. Hopefully, nobody will listen to him/her. And these rules won't comply with the official URN syntax we are working on here. Actually, I don't think that anybody will get the idea of using UTF-16 when UTF-8 is proposed. It's Unicode/ISO 10646 in both cases, the transform is easy, and people dealing with Unicode and stuff know that things such as URNs, with ASCII backwards compatibility requirements, are best handled using UTF-8. The things we have to worry about more are people that come along and say they want to use something else than Unicode/ISO 10646, or at least want to be able to use something else, because for some reasons, they are against Unicode, or they think it should just be one of many encodings. Also, what we have to worry about are cases where it's difficult to know what native character encoding is used, so that it is difficult to convert to Unicode and UTF-8 even if you wanted. ftp is a typical case of this. Anyway, even the two cases above are well handled in the draft. We tell them that if they really need to, they can use something else than UTF-8. Maybe we should tell them a little bit clearer that if they do that, they are not supposed to get any much of support by general tools, i.e. their URNs will just show up as %HH%HH..., and the users will have to type them in like this, and accidentally, some of it may be interpreted as UTF-8 and be displayed as such, but they cannot have their users type native characters into a browser and expect that the browser does the right thing (because it would in this case convert to UTF-8). Regards, Martin.